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ABSTRACT 

A brief discussion of the literature concerned with the 
two-population discrimination problem is presented and sev- 
eral procedures based on the likelihood ratio for discrim- 
ination between negative exponentially distributed populations 
are proposed. The small sample and asymptotic performance of 
these procedures is compared with that of non-parametric 
procedures and the classical linear discriminant function. 

Some guidelines for the use of the procedures discussed are 
presented. 
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INTRODUCTION 



I . 

The problem of classification arises when one or more 
measurements are made on an individual and one wishes to clas- 
sify the individual as belonging to one of a finite number of 
categories on the basis of these measurements. Each category 
is characterized by a probability distribution of the measure- 
ments, but the proper category of the individual is not ob- 
servable; it must be inferred from the measurements. Thus the 
problem, in abstract terns, is: given an observation of a 

random variable arising from one of several copulations , find 
a rule for deciding from which population the observation 
came . 

The classification problem is, then, one of finding an 
appropriate "statistical decision function." We have a num- 
ber of hypotheses: each hypothesis is that the distribution 

of the observation is that corresponding to a aiven popula- 
tion, and one of these hypotheses must be selected, the 
others rejected. 

In the classification problem, there are essentiallv 
three levels of information about the distributions corre- 
sponding to the various populations which may be available 
to the statistician. 

1. the distributions may be completely known 

2. the distributions may be known to belong to a 
given family indexed by a parameter which is 
unknown 

3. the distributions may be completely unknown 
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In cases 2) and 3) , information about the value of the param- 
eter or about the unknown distribution is usually available 
from a sample or sequence of realizations of the random var- 
iable corresponding to each population. 

In the investigations reported in this thesis, the in- 
dividual to be classified belongs to one of two populations. 
In this situation, case 1) above is equivalent to the simple 
vs. simple hypothesis testing problem whose solution is given 
by the Neyman-Pearson Lemma. Case 2) has received relatively 
little attention except under the assumption that the family 
of distributions is multi-variate normal with the same (but 
unknown) co-variance matrix. The distribution of the statis- 
tics arising in this situation have been derived. In addi- 
tion, Hoel and Peterson (5) have derived very general con- 
ditions under which procedures using sample estimates of 
the parameters are asymptotically optimal. Case 3) was 
first considered by Fix and Hodges in 1951. 

In Section II of this thesis the non-parametric proce- 
dure proposed by Fix and Hodges (2,3) and the application of 
this procedure when the distribution of the random variables 
is negative exponential will be reviewed. A bound on the 
error probabilities of the Fix-Hodges procedure discovered 
by Cover and Hart (1) and a more general procedure proposed 
by Loftsgaarden and Quesenbury (6) will also be examined. 

Section III will present the results of a study of a 
Likelihood Ratio discrimination procedure in case 2) above 
and a comparison of the performance of the various nrocedures 
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considered in this thesis when the random variables have the 



univariate negative exponential distribution. In 
conclusions and recommendations arising from this 
be presented. 



Section IV 
studv will 
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II. REVIEW OF LITERATURE 



Notation and Definitions 

In considering the classification problem, the following 
structure will be assumed. The two categories or populations 
have distribution functions F and G , and without loss of 
generality, since the measures with cumulative distribution 
functions F and G are absolutely continuous with respect to 
that given by F + G, the density functions f and g will be 
supposed to exist. Random samples from the two distributions 
are available: X.,...,X and Y, , . . . , Y independent and 

identically distributed as F and as G respectively; they may 
be used to obtain information about the respective distribu- 
tions. An observation z of the random variable Z is made, 
and the classification problem is to decide whether Z is 
distributed as F or as G. The abbreviation Z 'V p should be 
read "Z is distributed as F." The probabilities of misclas- 
sification will be designated as 

P^ = Pr {assign Z 'v g|Z ^ F} 

P 2 = Pr {assign Z v g|z 'v g} 

In the case that the distributions are negative exponential, 

F(x) = 1 - e and G(y) = 1 - e . 

Throughout this thesis reference will be made to discrim- 
ination procedures which tend to behave similarly in the limit; 
that is as the number of sample observations uoon which they 
are based grows very large. This concept may be made explicit 
by introducing two notions of consistency defined by Fix and 
Hodges ( 2 ) : 
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Definition 1: 



The sequences of decision functions {A '} and {A "} are 

n n 

said to be consistent in the sense of performance characteris - 
tics if, whatever be the true distributions of the random 
variables, for any e > 0 there exists N so that if in > N and 
n > N 



for every possible decision S^. 

Definition 2 : 

The sequences of decision functions (A'} and {A") are 
- 1 n n 

said to be consistent in the sense of decision functions if, 
whatever be the true distributions of the random variables, 
for any e > 0, there exists N so that if m > n and n > N 



It is clear that consistency in the second sense implies 
that in the first. All proofs of consistency by Fix and 
Hodges and those in this thesis provide consistency in the 
stronger sense. The modifying phrase will however be omitted. 

Discrimination when the distributions are completely known 

When the two distributions F and G are completely known, 
the problem of assigning an observation z to one of the two 
may be posed as a test of the hypothesis Z ^ F against the 
alternative Z ^ G. In this case, the Nevman-Pearson Lemma 
gives the procedure: Assign Z as distributed according to 



I Pr {A ' = 6 . } - Pr {A " = 6 . } I < e 
'mi n l ' 



Pr (A ' = A") > 1 - e 
m m 



F if 




where t is to be determined 

0 < t < 00 



13 



Assign Z F with probability y if 

f(z) _ . 

gTzT “ * 

Otherwise assign Z v G. This procedure is optimal in that for 
any assigned probability of error "of the first kind," i.e., 
Pr(assign Z ^ g|z v f} = P^, the probability of error "of the 
second kind," i.e., Priassign Z ^ f|z ^ G} = P 2 , of this 
procedure is no greater than that of any other. The value of 
t is chosen in the classical hypothesis-testing oroblem so 
that the probability of error of the first kind is some 
chosen value. Since the class of Neyman-Pearson tests is 
equivalent to the class of Bayes tests, the above procedure 
(for the appropriate choice of t) is also optimal with re- 
spect to minimizing any given weighted sum of the two error 
probabilities . 

This procedure will be designated L(t). In the case 
that F and G are negative-exponential distributions, the L(t) 
procedure is : 

Assign Z ^ F if and only if -jj e ^ ^ Z >_ t . 

Discrimination when the distributions are completely unknown 

When nothing can be assumed about the form of the distribu- 
tion corresponding to the two populations, the statistician has 

only the observations X. , . . . , X and Y, , . . . ,Y from which to 
2 1 m 1 n 

obtain information enabling him to classify Z appropriately. 

The procedures which Fix and Hodges (2) suggest involve the 
estimation of the densities f and g at the ooint of interest, 
and the use of these estimates in the likelihood ratio 
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procedure. The following theorem due to Fix and Hodges demon- 
strates the asymptotic optimality of this procedure. 

A A 

Theorem 1 : Let f and g denote estimates of the densities f 

/\ A 

and g respectively and let L*(t;f,g) denote the likelihood 
ratio discrimination procedure using f and g in place of f 

A A 

and g. If f (z) and g (z) are consistent estimates for 
3 m,n 3 m,n 

f(z) and g(z) for all z except possibly for z € N f where 

t / g 

P„(N,. ) = 0 = P_(N £ ) then L* (t;f,g) is consistent with 

F f,g G f,g m,n ' 

L ( t ) . 

The problem, then, is reduced to that of finding consis- 
tent estimates of the densities f and g. If the observation 
space is reduced to one dimension by a non-negative trans- 
formation p, such that x n -*■ x entails p(x n ,x) -*■ 0, and if, 
further, for each z except possiblv for a null set under both 
the F and G distribution p(X,z) and p(Y,z) are random var- 
iables with continuous densities not both zero at zero, then 
given the observation z to be classified, the observations 
X^ , . . . , X^; Y^ , . . . ,Y may be replaced by p (X^ , z) , . . . , p (X^, z) ; 
p(Y^,z),..., p(Y n ,z) and the discrimination involves non- 
negative univariate random variables. A consistent estimate 
of the transformed densities is given by the following theorem 
of Fix and Hodges. 

Theorem 2 : Let X and Y be non-negative. Let f and q be pos- 

itive and continuous at 0. Let k(m,n) be a positive, integer- 
valued function such that k(m,n) -*■ °°, jjj k(m,n) -*■ 0 and 
1 m 

— k(m,n) -*■ 0 as m,n -*■ 00 with — -*• 6 ^ 0 or °°. Define 



15 



t h 

U = k smallest value of the combined samples of X's 
and Y 1 s 

M = number of X's _< U 

N = number of Y's _< U 

then ^rr is a consistent estimate for f ( 0 ) and ^ is a consis- 
nU nU 

tent estimate for g(0). 

A A 

The L*(t,f,g) procedure thus requires: Assign Z 'v F if 

and only if 

f _ M/m 
g N/n — 

Performance of the Non-Parametric Discriminator with finite 
samples 

Fix and Hodges (3) continued the investigation of their 
non-parametric discrimination procedure by examining its per- 
formance for small samples where distributions are Normal 
with identical covariance matrix; that is, under conditions in 
which the linear discriminant function is known to be an op- 
timal procedure. The bulk of that investigation is for uni- 
variate distributions with k (the total number of the avail- 
able samples used in the classification) equal one. This is 
the "Rule of Nearest Neighbor" : classify Z 'v F if and only if 

z's nearest neighbor is an x. Fix and Hodges obtain the mis- 
classif ication probability for this procedure for a consider- 
able range of sample sizes and for distance between copulation 
means of 1, 2 and 3 times the standard deviation. Limiting 
error probabilities (as m = n -* °°) are obtained for k = 1 and 
k = 3 with distance between population means of 1 to 5 times 
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the standard deviation. Some results are obtained for bivar- 
iate normal distributions and an estimate of the performance 
of the discriminator for k > 3 is obtained. One verv inter- 
esting result of this investigation is that, regardless of 
the underlying distributions, as m = n 00 the two error prob- 
abilities of the rule of nearest neighbor are ecrual and no 
greater than one-half. 

Hager (4) investigated the performance of the "rule of 
nearest neighbor" under the assumption that F and G were neg- 
ative exponential. He contrasted this with the performance 
under the same conditions, of the linear discriminant func- 
tion and obtained misclassification probabilities for a wide 
range of (equal) sample sizes and parameter values for the 
latter procedure when F and G were Gamma distributions of 
order 1 to 20. His results in the exponential case are in- 
cluded in Section III of this thesis. 

Loftsgaarden and Quesenbury (6) proposed an alternative 
density estimator to that suggested by Fix and Hodges, which 
is consistent and applicable in a Euclidean space of any 
dimension. The procedure is let j (m) be a sequence of inte- 
gers such that 

lim j (m) = °° 

i im ilEl = o . 

m 

m->°° 

To estimate the density at a point z, using a sample x^ , . . . , 

x , let w , , w . . . ,w , >. be the transformed sample I x. -z I , . . . , 
m (1) (m) - 1 1 1 

lx — z I ordered from smallest to largest. Let A denote 

1 m 1 ^ w n ) , z 
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the volume (Lebesgue measure) of the hypersphere of radius 



w,.v centered at z, then 

(l) 



^ ( z) = 



n 



A 



w(j) ,z 

is a consistent estimate of the density f at the point z. 

If the density g at z is similarly estimated based on 
y^,...,y n , denoting the transformed sample by v(l),..., 
v(£) , . . . , v(n) (where £ (n) is a sequence with the same charac- 
teristics as j above) , then by Theorem 1 the procedure 

a, 

L*(t;f,g) which requires, assign Z ^ F if and only if 



1 ll . 



m 



A 



w ( j ) , z 



> t 



£-1 

n 



v (£) , z 

is consistent with the procedure L(t) and hence asymptoticallv 
optimal. Note that, if t = 1 and m = n, -j = £ , this procedure 
is identical with the Fix-Hodges procedure with k = j + £ - 1 
since a majority of the k nearest neighbors of z are x's if 
and only if w(j) < v(£). In the general case, the procedures 
L*(t;f,g) and L*(t;f,g) are quite similar but not identical. 
The density estimate f has applicability to problems other 
than that of classification, while the estimate f is not so 
versatile . 

In their paper, Loftsgaarden and Quesenbury report a 

a. 

small empirical study of the density estimator f when the true 
distributions are Uniform, negative exponential, and Normal. 
Based on this study, they recommend that the sequence j (n) 
take values not less than n 2 . 
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In an article published in 1967, Cover and Hart (1) 
evaluated the rule of nearest neighbor in a slightlv different 
context from that in which the previous investigations had 
placed it. Their work is in a Bayesian context so that there 
is a probability structure over the space {F,G} 

= PriZ'V'F} 

ri 2 = Pr{Zvg} 

It is assumed also that the random sample of X’s and Y's arise 
in a way so that there is one fixed sample size with the num- 
ber of X's within that sample being probabilistically deter- 
mined. 

If the classification loss function simply counts wrong 
decisions, i.e., the loss is 0 or 1 depending on whether the 
observation to be classified is assigned correctly or incor- 
rectly; if R* designates the expected risk of the Baves proce- 
dure with respect to a given prior distribution ( n , 1 — n ) where 
ri = Pr {Z^F} and if R designates the expected risk (with re- 
spect to the same prior distribution) of the rule of nearest 
neighbor, then the result for discrimination between two 
populations proved by Cover and Hart is given by the follow- 
ing : 

Theorem 3 ; Let the space of possible values of the random 
variables be a separable metric space. Let f and g be such 
that, with probability one x is either 1) a continuity point 
of f and g, or 2) a point of non-zero orobabilitv measure. 
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Then the expected risk R of the nearest neighbor procedure 
has the bounds 

R * < R < 2R* ( 1-R* ) 

These bounds are as tight as possible. 

A comparable bound is obtained for the case of discrim- 
ination among several populations. 
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III. A LIKELIHOOD RATIO DISCRIMINANT 



As was noted in the last section, when the probability 
structure of the two populations to be discriminated is known 
completely the likelihood ratio criterion gives the solution 
to the classification problem: that is, classify z as dis- 

tributed according to F if 
f ( z ) 

— — r- > t for some t , 0 < t < 00 
g(z) - - — 

The procedure which Fix and Hodges selected with which to com- 
pare the rule of nearest neighbor was the linear discriminant 
function, since that procedure is known to be ootimal under 
the assumption that the populations under consideration are 
Normally distributed with the same covariance matrix. Inves- 
tigation of the linear discriminant reveals that it is the 
likelihood ratio procedure using the estimates of the popula- 
tion means and the common co-variance matrix as though thev 
were known to be correct. Hager's investigation indicated 
that the use of the linear discriminant when the populations 
have the negative exponential distribution can give verv poor 
results and that, in general, the probability of misclassifi- 
cation is divided very unevenly between and It is not 

surprising that the linear discriminant performs ooorlv on 
distributions so radically different from the Normal as the 
negative exponential. In fact, good performance in this case 
would be quite surprising. 

In attempting to discover a parametric discrimination 
procedure with good properties, one might emulate the develop- 
ment which leads to the discriminant function and suqqest that 
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the random sample of the two populations be used to estimate 
the parameter of the distributions. The likelihood ratio 
procedure could then be carried out as thouqh the estimates 
were known to be correct. This procedure which will be 

/\ /s 

designated L(t;X,y) would then be 

Let X = 

m 

i=l 

Assign Z F if 

A /\ A 

4 e ^ ^ Z t for some t 0 t . 

y 

One may easily verify that this procedure is, indeed, 

P P 

asymptotically optimal. Since X X and y -> y as n,m -»■ «>, 
this result follows from Theorem 4 below, or from a more 
general theorem of Hoel and Peterson (5). 

Theorem 4 (Fix and Hodges) : If 

A 

a) the estimates {0 } are consistent and 

m ,n 

b) for every 0, fg(z) and gg(z) are continuous func- 
tions of 0 for every z except perhaps for z G Ng where 
Pr(Ng) = 0 under the distribution given by fg and that given 
by <Jg/ then the sequence of discrimination procedures ob- 
tained by applying the likelihood ratio principle with crit- 
ical value t > 0 to fj (z) and g Q (z) is consistent with 

U t) 

m,n m,n 

L ( t ) . 

It is noteworthy that the foregoing procedure (and the 
linear discriminant function as well) makes no use of the 
observation z in determining the estimates of the parameters. 



m 



y = 



x, 



n 



n 

I 

i=l 
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One might suppose that the use of z for this pumose would im- 
prove the performance of the procedure, at least for small 
sample sizes. Accordingly one could pose the problem as one 
of testing the composite hypothesis : z ^ F against the 
alternative : z v g , using the maximum likelihood estimates 
A and y in both cases so that 



X 



(m+1) 




^ (n+1) 

y = — 

l v^z 

i=l 1 



Accept Hq if 

A e (u ‘ X)z > t . 

'Xi — 

y 

a- a, 

This procedure which will be called L(t;X,y) is, of course, 

A A 

asymptotically equivalent to L(t;A,y), so that it too is con- 
sistent with L(t) and hence optimal in the limit. 

In the discussion up to this point, the problem of the 
choice of t in the two families of procedures which have been 
proposed has not been considered. The following lemma will 
clarify the problem. 



Lemma 1 : If t is restricted to be a constant in the procedure 

/\ /\ r L* ^ 

L(t;A,y) or L(t;\,y) as X, y range over the parameter space, 
then ift^l, asm,n-*-°° for any e > 0 there exists 6 so that 

if 1 1 - j-| < 6 , > 1 - e or P 2 > 1 - e . 

/\ /\ 

Proof: The procedure L(t;A,y) requires: assign Z ^ F iff 



or 



(y-A)z _> £n ^ + Jin t 

X 

~ P - P 

Let m,n ■> 00 so that X -*■ A , y -* y and suppose (without loss of 
generality) that X < y. Then the procedure assigns Z ^ G in- 
correctly if and only if 



z 



< 



* n x 

y-X 



+ 



£n t 

y-x * 



Now suppose t > 1; since 

P^ = Pr{assign Z ^ g|z 'x* F} 



Jin £ 

Pr{Z < £ + 

y-X 



Jin t-. 
y-x* 



= 1 



= 1 



exp { 



inj 

!- X 






1-i 



Jin t 




} 



the desired inequality is achieved if 






i-i 



£ 



or 



y 

1 x , 
H(I) < i_ 

XV t £ 



since, by assumption, ^ > 1 and t > 1, there exists 6 > 0 so 



that if y “ 1 < 6 the inequality above is satisfied and the 



X 

desired conclusion follows. If t < 1 a similar argument shows 

r\j <X/ 

Since L(t:A,y) 



that for appropriate values of y, P 2 



> 1 - £ . 



is asymptotically equivalent to L(t,X,y) the result follows 
for the former procedure as well. 
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It is noteworthy that for t=lasm,n->°° 




lira P 




lim 1 - exp{ 




and similarly 



lim P 2 = e 



-1 



(The subscripts on the P's are reversed as y-*l ) 



In fact, it is easily verified that, for t = 1 as m,n -*■ 00 



is not a desirable situation; however, it is better than the 
situation which obtains in the use of the linear discriminant 
function where, as Hager discovered, for .3863 = [2(&n2) - 1] 

< y < l/[2(£n2) - 1] = 2.589, P^ > % or P 2 > %. Recall that 
the rule of nearest neighbor has both error probabilities 
bounded above by % as m,n 00 irrespective of the population 
distributions . 

The above results are asymptotic and imply little about 
the performance of the procedures for small samples. They do, 
however, sharpen the problem which must be faced in using the 



L(t;X,y) procedure. Either t is fixed at 1 (for if t / 1 the 
procedure may become arbitrarily bad as m,n -> °°) or t is made 
a function of the observations. If the latter course is elec- 
ted, one might be interested in preventing the possibility of 
misclassifying an observation with higher probability than 
one-half. A plausible way to pursue this goal would be to 
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seek a minimax procedure; i.e.. 



one which would make equal 



to P 
t = 



2 * To do this one would, given the estimates A,y seek 

A A 

t ( X , y ) so that 

P_{Z: 4 e (y_A)Z < t} = P r {Z: l e (y ~ X)Z > t} 

F y G y 



and use this value of t for the discrimination. The perform- 
ance of this "minimax" procedure is reported in this thesis. 

In the foregoing material, the ratio has occurred fre- 
quently. It would be desirable for a discrimination procedure 
to depend on the parameters of the distributions onlv through 

A A 

this ratio. Indeed this is true, for both L(t;X,y) and 

a, "V 

L ( t ; A , y ) . 



Theorem 5 : In the procedures L(t;X,y) and L(t;A,y), 

P^ = Pr(assign Z G [ Z ^ f} depends on A,y only through c = 
A lemma will be established first: 



Lemma 2 : If X has the negative exponential distribution with 

parameter X, then X is distributed as (-£n U)/A where U has 
the Uniform (0,1) distribution. 

Proof of lemma: 



Pr{X _< x} = F(x) = 1 - e 
p r {— — — ^ <_ x} = PrHn U _> - Ax} 
= Pr (U > e -Xx } 



= 1 



e 



-Xx 



The result follows by the Caratheodory extension theorem. 
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■c|v 



Proof of Theorem 5 



Suppose n = m = 1. Then for the procedure L(t;X,y) 



Pr(assign z v g|z v f} = PrC^-exp 



(1.1) z 
Y X ] Z 



< 1 1 Z v F } 



„ f A in V \, y 
= Prt- — n ' _ tt exp ( 



•y Jin U 
Jin V 



Jin V Jin U' X 
1/c 1 



x ) to W) 1 < t) 



, Pr{c TFi exp ( ‘ ln w) ] < t} 

where U, V, and W are independent and identically 
uniformly distributed on (0,1) by lemma 2. 

'Xi 'Xi 



Similarly for L(t;X,y), 

Prtassign Z v G | Z v f} = exp jj 

’Jin V. Jin W 



X+Z Y+Z' 



) Z J < 1 1 Z v F} 



= Pr < 


j y x 


An U L An W 


1 


_ 1 A 


= Pr j 


) cAn V+An W 


1 An U+An W 



exp 



(-Jin W) 



Jin UAn W Jin V, Jin W, 

Cx — H — h T 



< t 



exp 



(-Jin W) 



Un U+Jln W cJln V+An Wy 
The result for arbitrary m,n follows by induction. 

Note that = Pr (assign Z v f|z v g) 

= Pr{4 exp[(y-X)Z] > t|z v g} 

y 

A 

= Pr{U exp[(X-y)Z] < l/t|z v G} 

X 

is equal to P^ for the situation in which X and y have been 

A A 

interchanged and t replaced by 1/t, i.e., P 2 for L(t;X,y) equals 

/s /s _ v v 

P^ for L(l/t;y,X). A similar statement is valid for L(t;X,y). 

In seeking the error probabilities of the procedure 

/\ *\j Xj 

L(t;X,y) and L(t;X,y) one must calculate 



< t 
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p 



1 



p{X e (y-A)z < t | z ^ F} 
y 

P{An i - An i + (y-A)Z < An t|z ^ f} 
y A 



/\ /N J-Q 

where in procedure L(t;A,y), ^ Gamma (A,m) 

A 

z ^ Gamma (A,!), and in procedure L(t;A,y), 



n 

y 

m+l 

A 



'X, 



Gamma (y ,n) , 
U + Z where 



U ^ Gamma (A,m) , Z v Gamma (A,l) so that U + Z ^ Gamma (A, m+l) 

n+1 ~ ~ 

— = V + Z where V ^ Gamma (y,n). In the L(t;A,y) procedure, 



y 

if t is a constant, it appears that P^ should be calculable 

by a straightforward triple numeral integration. In the 

-x, % 

L(t;A,y) procedure the boundary of the region of integration 
for Z involves the solution of a transcendental equation, but 
this too may be done numerically and P^ calculated for fixed 
t. However, when t is permitted to be a function of the obser 
vations , the integral becomes intractable. For this reason, 
and because the investigator wished to compare the performance 
of the Likelihood Ratio procedures to that of the Loftsqaarden 
Quesenbury procedure which is almost impossible to assess 
analytically, the decision was made to conduct this investiga- 
tion through a Monte-Carlo study. The following procedures 
were investigated 



1) 


L ( t ; A , y ) 


t = 1 


2) 


A A 

L ( t ; A , y ) 


t = 1 


3) 


X, -X, 

L ( t ; A , y ) 


"minimax" 


4) 


L (t ; A ,y) 


"minimax" 


5) 


Rule of 


nearest neighbor 


6) 


Lof tsgaarden-Quesenbury procedure L*(t;f,g) 
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The computer program, run on an IBM 360 computer, gener- 
ated, by means of the probability integral transform, the ran- 
dom sample of X's and Y's, and the observation Z to be clas- 
sified. The various classification procedures were performed 
and correct or incorrect classification of z was recorded. 

The Monte Carlo procedure may be viewed as an attempt to 
estimate the parameter p of a Bernoulli random variable? i.e., 
the probability with which a randomly selected observation will 
be misclassified. As such, the distribution of the estimates 
which have been obtained may be estimated. Since p is reason- 
able close to one-half in all cases, and since 10,000 replica- 
tions of the Monte Carlo procedure were summed, it may be 
assumed that the estimate 

10,000 
l B - 

i = 1 1 

P 10,000 

where B^ = 0 with probability (1-p) , 1 with probability o, 
has approximately the Normal distribution with mean p and var- 
iance i q^~ q - ^ - q £ .25 x 10 ^ . Hence a 95% confidence interval 
may be formed for the value of p in each case 

.95 = Pr{|p-p| _< 1.96a} 

< Pr { | p-p | < . 0098} . 

For comparison with these results, the analyticallv com- 
puted misclassification probabilities of the rule of nearest 
neighbor and linear discriminant function obtained by Hager 
are reproduced. 
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Table 1 



Misclassif ication Error Probabilities for Procedures 



Description of Procedures: 

t = 1 



1. 


L ( t ; 


oj a, 

A / M ) 


2. 


L ( t; 


/\ /\ 

A ,M) 


3. 


L (t ; 


'V 'b 

a,m) 


4. 


L (t ; 


/\ z\ 

A,m) 


5. 


L* (t; 


f, 


6. 


L* (t; 


'b 

f, 


7. 


"Rule 


of 



t = 1 
"minimax" 

"minimax" 

t = 1 "Rule of Nearest Neighbor" 
t = 1 Loftsgaarden and Ouesenbury Procedure 

h 

— - 



j (n) = Z (n) = n 



8. Linear Discriminant Function - from Hager (4) 



C = 
N = 



a/m 

size of sample from each pooulation upon which classifica- 
tion procedure is based 
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Table 2 



EXPECTED 


RISK 


FOR ALL PROCEDURES 


WITH 


PRIOR 


(.5, 


.5) 


Procedure 


1 


2 


3 


4 


5 


6 


7 


8 


C N 


















1.5 1 


.488 


.491 


.495 


. 495 


.487 


. 487 


.488 


.488 


2 


.482 


.483 


.481 


.483 


.488 


.488 


.486 


.480 


5 


.470 


. 471 


.476 


.475 


.487 


. 480 


.484 


. 467 


10 


.453 


. 454 


.455 


. 454 


. 483 


.478 


. 483 


. 455 


20 


. 444 


. 443 


.448 


. 448 


.480 


.470 


**** 


.442 


0O 


.426 


.426 


.430 


.430 


★ * ★ "k 


.426 


.481 


. 426 


2.0 1 


.465 


.467 


. 462 


. 466 


.468 


.468 


. 467 


.467 


2 


. 448 


.450 


.451 


.449 


.460 


. 460 


.461 


.447 


5 


.410 


. 410 


.417 


. 414 


.456 


.444 


.456 


. 416 


10 


. 393 


. 392 


.401 


. 400 


.453 


. 427 


. 453 


. 394 


20 


. 377 


. 377 


. 381 


. 381 


. 453 


. 420 


.452 


. 381 


00 


. 375 


. 375 


.382 


. 382 


**** 


. 375 


.451 


. 375 


3.0 1 


. 419 


.425 


. 427 


. 431 


. 425 


. 425 


. 424 


. 424 


2 


. 378 


. 380 


. 386 


. 386 


. 414 


.414 


. 411 


. 385 


5 


. 333 


. 333 


. 337 


. 336 


. 401 


. 378 


. 401 


. 338 


10 


. 315 


. 316 


. 326 


. 326 


. 391 


. 361 


. 397 


. 319 


20 


. 311 


. 311 


. 324 


. 323 


. 398 


. 350 


. 395 


. 313 


OO 


. 308 


. 308 


. 318 


. 318 


* * * * 


. 308 


. 395 


. 309 
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Table 2 (Continued) 



EXPECTED RISK FOR ALL PROCEDURES WITH PRIOR 



Procedure 
C N 

5.0 1 
2 
5 

10 

20 

oo 

10.0 1 
2 
5 

10 

20 

OO 

20.0 1 
2 
5 

10 

20 



. 352 


. 358 


. 289 


. 294 


.248 


.250 


.235 


.237 


. 234 


. 235 


. 233 


. 233 


. 258 


.270 


.196 


.206 


.166 


.169 


.158 


.160 


. 155 


.155 


.152 


.152 


.194 


.208 


.133 


. 141 


. 105 


.107 


. 098 


.101 


.092 


. 093 


. 094 


.094 



. 355 


. 366 


. 304 


. 307 


.262 


.262 


.249 


. 249 


. 245 


. 245 


. 245 


. 245 


.259 


.285 


. 211 


.224 


. 180 


.184 


.173 


. 175 


.167 


.168 


.165 


.165 


.184 


.228 


.140 


.161 


.114 


. 124 


.111 


.115 


. 104 


. 106 


.106 


.106 



. 365 


. 365 


. 340 


. 340 


. 320 


. 296 


. 314 


.269 


. 316 


. 258 


* * * * 


. 233 


. 285 


. 285 


.249 


. 249 


.227 


. 205 


.222 


.181 


.222 


.168 


**** 


.152 


.231 


.231 


.184 


.184 


.153 


. 149 


. 145 


.123 


. 142 


. 105 


* * * * 


. 094 



(.5, 


.5) 


7 


8 


. 361 


. 361 


. 338 


. 307 


. 322 


. 264 


. 319 


. 255 


. 318 


.253 


. 319 


. 250 


. 286 


. 286 


. 248 


. 235 


. 226 


. 215 


. 222 


.213 


. 221 


.213 


. 222 


. 214 


.233 


.233 


.184 


.201 


. 153 


. 198 


. 146 


.201 


.145 


.202 


.145 


.204 
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IV. SUMMARY AND CONCLUSIONS 



A number of interesting facts are evident from insoec- 
tion of the results of the investigations conducted in this 
thesis. Perhaps the most startling is that for values of 
c not greater than 5 and all sample sizes up to and includ- 
ing 20 the expected risk (with prior (%,%) ) of the linear 

discriminant function is uniformly smaller than that for 
either of the non-parametric procedures (see Figure 1) . The 

/s /\ 

linear discriminant is equivalent to procedure L(t;X,y) with 
t chosen in a somewhat bizarre fashion, since it divides the 
positive line into two intervals which are acceptance re- 
gions for {Z 'v f) and {Z 'v G}. Hence the linear discrim- 
inant minimizes P 2 f° r the which it achieves, and though 
the division of the total error probability is very uneven, 
the average is small enough to better the non-parametric 
procedures . 

Also interesting is the fact that the expected risks of 
procedures L(t;X,y) and L(t;X,y) are almost identical even 

for very small sample sizes. In general P. is larger for 

a, 'x, 

/\ A 

L(t;X,y) than for L(t;X,y) but P 2 for the latter procedure 
is smaller so as to keep the average almost constant. The 
"minimax" procedure appears to achieve the desired equaliza- 
tion of P^ and P 2 fairly well for moderate sample sizes 
(n _> 10) , but fails quite badly for n = 1 or 2 . It appears 
that, for n > 5 the average risk is not increased aooreciablv 
by using the "minimax" procedure. 
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Expected Risk vs. n for Selected Procedures; 



The negligible improvement in the performance of the 
likelihood ratio discriminant procedures for sample sizes in 
excess of 10 and the extremely slow approach to optimality 
of the Loftsgaarden and Quesenbury procedure are also inter- 
esting. An example of this for c = 10 is shown in Figure 2. 

The considerable disparity of the values of P^ and P 2 
for many of the procedures considered in this thesis raises 
an interesting philosophical point which an investigator 
should settle for himself before selecting one of these meth- 
ods for use. If, for example, one is willing to accept the 
possibility that a large percentage of the members of one 
population will be misclassified, although the averacre num- 
ber of misclassif ications is apt to be moderate, then the 
use of the linear discriminant function may be preferable to 
the use of the non-parametric procedures (unless c is very 
large). If, however, one is reassured by the fact that the 
rule of nearest neighbor makes errors no more than half the 
time (asymptotically) no matter what the situation, one may 
have a predilection for that procedure. The superiority in 
terms of expected risk of the linear discriminant function 
over the non-parametric procedures for small c is shown in 
Figure 1 where, for example for n = 2, c = 5 the linear dis- 
criminant has expected risk about .03 lower than the rule of 
nearest neighbor; for n = 00 , c = 5 the difference is almost 
.06. In fact, the performance of the linear discriminant 
where c < 3 is almost identical with that of the best oroce- 

r\j 

dure in this range, L(l;X,u). However, reference to figure 
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Figure 3 

For Selected Procedures; c = 5 
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0 0 



3 indicates that in the same cases, P 2 for the linear discrim 
inant is much greater than that for the rule of nearest neigh 
bor. Also apparent in Figure 3 is the non-mono tonicity of 
for several procedures. Table 1 gives both and P^ for all 
cases considered in this thesis so that expected risks for 
mixing probabilities other than (%,%) may be easilv calcula- 
ted. 

The following recommendations seem approoriate based on 
this study. If one can be reasonablv certain that the pop- 
ulations are negative exponential, and there is no reason to 
suppose that the unknown observation is more likely to be 

from one of the populations than from the other, the minimax 
v v 

version of L(t;A,y) (Procedure 3) would be a good choice if 
n ^ 5. For smaller samples the same procedure with t = 1 
(Procedure 1) seems better. If observations from one of the 
populations are appreciably more likely than those from the 
other, a procedure taking this fact into account bv taking 
more observations from the more likely population and/or 
estimating the probability of occurrence of the populations 
(if these probabilities are not known) should be considered. 

A selection of the parameter t in the chosen procedure in 
order to minimize the expected risk with respect to the 
estimated (or known) population probabilities could then be 
made. Because the probability of classification error does 
not decrease appreciably as n increases from ten to infinitv 
for the likelihood ratio procedures , it appears that the use 
of samples larger than ten in Procedures 1 - 4 is unwarranted 
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unless the cost of sampling is very small. If one cannot be 
certain that the populations are negative exponential, a 
choice between linear discriminant and a non-parametric pro- 
cedure may be appropriate. The attitude of the experimenter 
toward the importance of and individually should in- 

fluence his decision in this case. 
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